nhslogo CS4132 Data Analytics

Effect of Air Quality and Pollution on Health Issues

by Teoh Yu Xin (M21405)

Table of Content

  1. Motivation and Background
  2. Summary of Research Questions and Results
  3. Dataset
  4. Methodology
    1. Data Acquisition
    2. Data Cleaning
    3. Data Exploration and Analysis (Preliminary Results)
  5. Results
  6. Conclusion and Recommendations
  7. References
  8. Appendix

Motivation and background

Over the years, the environmental situation of Earth has been closely monitored and investigated closely. Air pollution, in particular, is an important environmental factor. Therefore, I would like to investigate how the relationship between different types of pollutants (i.e. PSI index, SO2, NO2, different factors affecting air quality) in relation to number of respiratory related diseases (tendency of increasing risk of respiratory diseases, life expectancy, getting lung cancer etc) based on the number of deaths or the number of lives lost as a result of these air pollution related causes, also known as DALYS (Disability adjusted life years).

Knowing the relationship between the heatlh of people with air pollution, and the extent to which air pollution relates to the health of people plays a significant role in coming up with solutions to improve the health of people.

Summary of research questions and Results

  1. How does different kinds of air pollution affect the health of people in the region?

    • Different pollutant possibly have different effect on health, where PM2.5, PM10 show moderate positive correlation to both the average deaths and average dalys per 100 000 of the country population. Particularly, the effect of AQI of PM2.5 on the average deaths per country population shows a moderately strong linear positive correlation.
  2. How does worsened air pollution affect the risk of different respiratory related diseases differently?

    • The AQI value of each country have varying effects on different death and dalys causes, some showing little to no relation, however some causes like air pollution, household pollution as death cause, and air pollution pollution as dalys cause show moderate positive correlation (though linear regression may not be the best model).
  3. Does the effect of air pollution in a region affect the health of people of different genders differently?

    • The AQI value appears to affect the health of females slightly less than males, with females having a lower median average death value as compared to males for the same country. Females are less affected by air pollution than males.
  4. How does the air quality effect on health of people vary from 2014 to 2017?

    • The extent to which health of peopple is correlated to AQI value is different for different years. For a smaller increase in AQI index, the extent to which the number of deaths increased in 2014 is the highest, followed by 2015, 2016 and 2017. For a smaller increase in AQI index, the extent to which the number of dalys increased in 2014 is the highest, followed by 2015, 2016 and 2017, which both happen to follow chronological order.
  5. How does the air quality effect on health of people vary across different geographical locations?

    • Air pollution may have different effects the health of people in different countries around the world differently, where North and South America have relatively low index values for deaths/dalys over aqi value. Countries in Africa has a higher deaths/dalys over aqi value. Some outstanding countries like Australia, has the highest index for dalys of 1.49 (high number of dalys per 100 000 of country population to AQI index ratio), has a significantly high index (though slightly lower than that for dalys) of 1.04. Brunei (country code of BRN) has the highest index for deaths of 2.31, indicating it has a high number of deaths per 100 000 of country population to AQI index ratio.
  6. How are individual factors varied over the years? (i.e. Air pollution and emissions, Health causes)

    • Many individual factors have decreased over the years, which include average number of deaths/dalys and AQI value. Papau New Guinea has the highest average deaths and several African countries have high average dalys. There is a jump from 2016 to 2017 and from 2020 to 2021, with the increase from 2016 to 2017 being a sharper increase than from 2020 to 2021.

Dataset

This is the list of datasets used in this project, with dataset links found under References Section.

  1. air_pollutant_co2.csv (Singapore Data for CO2 Emissions from fossil fuel combustion)
  2. air_pollutant_lead.csv (Singaopre Data for Lead)
  3. air_pollution_exposure.csv (amounnt of pm2.5 pollutants by country over the years)
  4. aqi_breakpoints.csv (Standard table for AQI Breakpoints of different pollutants)
  5. death-rates-from-air-pollution.csv (number of deaths per 100 000 based on different types of pollution)
  6. disease-burden-by-risk-factor.csv (number of DALYS (disability adjusted life years) per 100 000 based on different types of pollution)
  7. parameters.csv (standard units for AQI breakpoints)
  8. pneumonia-death-rates-age-standardized.csv (number of deaths per 100 000 for lower respiratory infections)
  9. respiratory-disease-death-rate.csv (number of deaths per 100 000 for chronic respiratory disease)
  10. singstat_subcollation.csv (collation of pollutant emissions in Singapore)
  11. stats_oecd_pollutants.csv (amount of pollutant emissions by year, by country)
  12. who_respiratory_pollution_caused_rate.csv (number of attributed deaths per 100 000 due to different respiratory diseases)
  13. waqi-covid19-airqualitydata-2015H1.csv (AQI Data from Quarter 1 of 2015)
  14. waqi-covid19-airqualitydata-2016H1.csv (AQI Data from Quarter 1 of 2016)
  15. waqi-covid19-airqualitydata-2017H1.csv (AQI Data from Quarter 1 of 2017)
  16. waqi-covid19-airqualitydata-2018H1.csv (AQI Data from Quarter 1 of 2018)
  17. waqi-covid19-airqualitydata-2019Q1.csv (AQI Data from Quarter 1 of 2019)
  18. waqi-covid19-airqualitydata-2019Q2.csv (AQI Data from Quarter 2 of 2019)
  19. waqi-covid19-airqualitydata-2019Q3.csv (AQI Data from Quarter 3 of 2019)
  20. waqi-covid19-airqualitydata-2019Q1.csv (AQI Data from Quarter 4 of 2019)
  21. waqi-covid19-airqualitydata-2020Q1.csv (AQI Data from Quarter 1 of 2020)
  22. waqi-covid19-airqualitydata-2020Q2.csv (AQI Data from Quarter 2 of 2020)
  23. waqi-covid19-airqualitydata-2020Q3.csv (AQI Data from Quarter 3 of 2020)
  24. waqi-covid19-airqualitydata-2020Q1.csv (AQI Data from Quarter 4 of 2020)
  25. waqi-covid19-airqualitydata-2021.csv (AQI Data from 2021)
  26. https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_future_population (population by country from 1950 to 2050, only used till 2020)
  27. https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes (country codes corresponding to countries)

Methodology

Since the data required is not unique to each research question, data acquisition and data cleaning will be done for all datasets before data exploration and analysis pertaining to each research question.

Data Acquisition

Relevant Imports

Singapore Data

Global Data

Calculation Data

Processing and Reading Raw Data

The next few cells are the printouts for all of the raw datasets.

Data Cleaning

singapore_lead_data is obtained from air_pollutant_lead.csv, which is exported from Singapore Lead Data.

singapore_co2_data is obtained from air_pollutant_co2.csv, which is exported from Singapore CO2 Data.

singapore_subcollation_data is obtained from air_pollutant_exposure.csv, which contains a collation of other pollutant values, which is exported from Singapore Subcollation Data. (Side note: this link does not work if clicked on directly, will have to paste the actual link to view the dataset "https://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=14589".

singapore_co2_data and singapore_lead_data are formatted to match formatting of singapore_subcollation_data to combine into singapore_collated_data which contains all of the values from the various pollutants. Several renaming, dropping of columns and converting of column types are done to make the naming concise and consitent.

country_population_data_1950_1980, country_population_1985_2015, country_population_2020_2050 tables are obtained from the same website Country Population Over the Years in separate tables. Note that this table has the population values in thousand.

country_code_data gives the country codes corresponding to the respective country names which are obtained from Country Code Data. The 3-letter code is obtained from the table, and then joined with the country_population_data to match the country name to the country code (majority of the other datasets contain the 3-letter code, which is more standardised and easier to compare, and useful for plotting choropleth maps in EDA for the respective research questions).

Manual intervention is done to countries stated below as the country name cannot be matched to the country code due to spelling error or additional words and cannot be done through other ways.

All of the waqi datasets are obtained from WAQI (World AQI Data) (numbered from 13 to 25 above) which are all in the same format, though the year (period of time of the year) is different. Therefore, all of the datasets are appended to each other before doing data cleaning. Waqi datasets also use the 2-letter country code instead of 3-letter code, therefore data from country_code_data is obtained to match the 2-letter to 3-letter country codes for standardisation.

The assumption is made that the value provided under Specie column of raw waqi dataset is the AQI value of the pollutant. (A simple visual cross check between the country PM2.5 value of waqi_data_total and world_oecd_pm25_data_total shows that the values are quite close to each other)

The function below divides the number of dalys for each country by the respective country's population to standardise the number per 100 000 of the population to mitigate the effect of the population of the country on the number of dalys years.

As the data for country populations are only available every 5 years, the percentage growth is used to project the population in a particular year based on population growth or shrink from the nearest available population data. For example to get the population of the country in 1996, the country's population in 1995 is taken, with will then be multiplied by the percentage growth of the population as indicated by '1995 %' in this case.

The picture below displays how the AQI value is calculated based on the concentration of a certain pollution based on the standard table of AQI breakpoints. The function calculate_aqi takes in the relevant parameters (row of dataframe, pollutant, pollutant name) and calculates the corresponding AQI value using the dataframe aqi_breakpoints_data_processed.

aqi_breakpoint_calculation_formula.png

Data Exploration and Analysis (Preliminary Results)

  1. Research Question 1 - Effect of type of air pollutant on health of people
  2. Research Question 2 - Effect of air pollution on death / dalys cause (respiratory-related diseases/air-pollution related)
  3. Research Question 3 - Effect of air pollution on health of different genders
  4. Research Question 4 - Effect of air pollution on health of people from 2014 to 2017
  5. Research Question 5 - Effect of air pollution on health of people across different geographical location
  6. Research Question 6 - Trend of air pollution/health factors across different years

Q1. Effect of type of air pollutant on health of people

Air pollution can be factored by different kinds of air pollutants, namely particulate matter (PM 2.5 and PM 10), ground level ozone (O3), carbon monoxide (CO), nitrogen dioxide (NO2), sulfur dioxide (SO2) in the environment. There are many factors to which an increase in respiratory related diseases can be brought about, therefore I would like to find out whether there is an impact and the extent different air pollutants worsens the health of people. Health of people can be defined as the number of cases of respiratory related diseases or mortality rate. In this project, the average number of deaths and average number of dalys (disability adjusted life years) are provided / calculated.

I first collate the level of different pollutants and calculate their air quality index categorised by country, then add up the number of associated deaths due to respiratory diseases. I can compare the countries having the highest AQI / highest average dalys / associated deaths, AQI values for different types of pollutants and observe the relationship of the concentration of pollutants against the number of associated average deaths per 100 000 of population.

Disclaimer: Data for x axis and y axis are spanding across different years, therefore average is taken to mitigate the effect of year on the results obtained.

The AQI value for each country is obtained, which is merged with the data containing the average dalys per 100 000 and average deaths per 100 000. The mean value of the AQI and average dalys is taken to match it into one data entry.

This graph shows a scatterplot of average dalys per 100 000 against aqi value. It can be noted that majority of the datapoints cluster at the bottom left corner (low average deaths and low AQI). It can be observed that there is a weak positive correlation that may not necessarily be linear. Further investigation and EDA needs to be done to investigate the correlation for each pollutant individually as it is difficult to tell from this combined scatterplot alone.

This graph shows a scatterplot of average deaths per 100 000 against AQI value. It can be observed that there is a weak positive correlation that may not necessarily be linear. Further investigation and EDA needs to be done to investigate the correlation for each pollutant individually.

By using the same x-axis, it is unable to show a good regression plot for pollutants= CO, NOX, O3, therefore small multiples is used instead to allow for scaling of the x axis for different regression plots.

From the plot of small multiples, it can be observed that PM 10 and PM 2.5 display the greatest linear positive correlation among the different pollutants. Pollutants CO, NOX and SOX show little to no positive or negative correlation. Further investigation by performing linear regression model on these pollutants will be carried out.

Interestingly, the pollutant O3 seems to indicate a weak negative correlation, meaning that a higher AQI value (worsened air quality) indicates a lower average number of dalys. This may already be due to O3 AQI values being low (ranging from 0 to 35) such that they do not have much effect on the dalys, such that the weak correlation may be insignificant.

By using the same x-axis, it is unable to show a good regression plot for pollutants= CO, NOX, O3, therefore small multiples is used instead to allow for scaling of the x axis for different regression plots.

From the plot of small multiples, it can be observed that PM 10 and PM 2.5 display the greatest linear positive correlation among the different pollutants. Pollutants CO, SOX and NOX show very weak to negligible positive correlation. Further investigation by performing linear regression model on these pollutants will be carried out.

Interestingly, the pollutant O3 seems to indicate a weak negative correlation, meaning that a higher AQI value (worsened air quality) indicates a lower average number of deaths. This may already be due to O3 AQI values being low (ranging from 0 to 35) such that they do not have much effect on the deaths, such that the weak correlation may be insignificant.

There are many respiratory related diseases such as respiratory infections, pulmonary heart disease and lung cancer which may be affected by air pollution to different extents. I would like to find out which respiratory related disease is the most adversely affected by air pollution if any (i.e. a small decrease in air quality can significantly increase the risk of having a particular respiratory related disease).

I first collate the level of different pollutants based on country and their respective AQI levels, and merge them to form one single overall pollutant index (mean AQI value), then collate the number of associated deaths due to the respective respiratory diseases. This data is compared against the number of deaths / dalys per 100 000 of population for each cause.

The graph shows the number of deaths per 100 000 of the country population against the AQI value, which is difficult to tell from this scatterplot any positive correlation as the points seem to be randomly scattered, with clustering at low AQI and death values. Therefore, further investigation is required to see each death cause respectively to determine whether there is a relationship.

The graph shows the number of dalys per 100 000 of the country population against the AQI value, which is difficult to tell from this scatterplot any positive correlation as the points seem to be randomly scattered, with clustering at low AQI and dalys values. Therefore, further investigation is required to see each dalys cause respectively to determine whether there is a relationship.

Small multiples is used to plot each death or dalys cause to one linear regression plot as different death causes have a different x axis range, which is also significantly different for dalys and deaths therefore a more suitable representation to scale the x axis to each death cause, showing a clearer relationship.

From the graph above, it can be observed that death (air pollution per 100 000) (sum of household, pm and ozone) against AQI, death (household per 100 000) which means death associated with household air pollution, and dalys (air pollution per 100 000) shows the most positive linear correlation. Further investigation for these causes are to be investigated by performing a linear regression model on each of them.

For death causes chronic respiratory disease, lower respiratory infection, pm, ozone, they show a very weak positive correlation between average number of deaths per 100 000 of country population against AQI.

For death causes chronic obstructive pulmonary disease, ischaemic heart disease, stroke and dalys causes pm show negligible correlation, showing that there is no relationship between AQI value and number of deaths or dalys per 100 000 of country population.

Interestingly, death causes tranchea, bronchus, lung cancers shows a negative correlation (though weakly linear), indicating that an increase in AQI value (higher AQI, worse pollution) is related to decreased number of deaths per 100 000 of a country's population. This may be due to other factors like smoking (not taken into account in air pollution), being a greater factor to the death associated with tranchea, bronchus and lung cancers.

Q3. Effect of air pollution on health of different genders

I want to find out how air pollution affects different genders differently or equally, whether there is a significant difference to which the extent of health is adversely affected based on gender alone.

I can first collate the number of males and females respectively per country affected by these respiratory related diseases. Then I can draw appropriate graphical representations for females and males for different respective countries to compare the general relation and difference between genders (if any).

From this facetgrid, it can be observed that a lot of points are clustered at low AQI values and unable to see any positive correlation value. Therefore, no relationship can be seen for average number od deaths against AQI. However, since the AQI value for each country is the same (i.e. the AQI value for each corresponding datapoint for the same country is the same), a categorical scatterplot can be plotted just for the average deaths per 100 000 for each country for further investigation and analysis.

Based on this graph, it can be observed that the median value for females is slightly lower than males. This shows that death of females are less affected by air pollution as compared to men. Both the range and interquartile range of females is smaller than male, showing that the average number of deaths per 100 000 for females are more clustered and spread over a smaller range of values. Both females and males show several outliers above Q3 + 1.5 IQR, showing that there are few exceptions with extremely high average deaths per 100 000.

Q4. Effect of air pollution on health of people from 2014 to 2017

With the improvement of healthcare provision over the years, the health of people is expected to improve over the years. However, with the advancement of society, there has been increased air pollution. Therefore, I want to find out how air quality effect on the health of people has varied across the past few years, whether the correlation (if any) between the number of respiratory related diseases and concentration of air pollutants is stronger or weaker across different years.

I can first collate the level of different pollutants based on country, and merge them to form one single overall pollutant index, then plot a suitable graphical representation like line graph for each country to see the general trend of pollution index by year as well as overall effect on prevalence of respiratory diseases.

From the graph above, it can be observed that for all years, the number of deaths per 100 000 of population against aqi show a positive correlation (i.e. higher AQI is associated with higher number of deaths per 100 000 of population). However, the linear regression plot for different years can be observed to be slightly different, showing that the extent to which AQI is positively correlated to deaths per 100 000 of population is different. It can also be observed that the datapoints are clustered around the low AQI and low number of deaths per 100 000 population for all years.

From the graph, the year 2014 shows the steepest linear regression plot, followed by 2015, 2016 then 2017. This indicates that for a smaller increase in AQI index, the extent to which the number of deaths increased in 2014 is the highest, followed by 2015, 2016 and 2017, which happens to follow chronological order.

To further observe the individual trends for the different years, a FacetGrid is plotted for each year.

From the graphical representation above, it can be seen that the datapoints for all years are clustered at the lower AQI and lower average deaths per 100 000 population value. In 2016 and 2017, there are a few datapoints with very high AQI values above 120 unlike in 2014 and 2015.

From the graph above, it can be observed that for all years, the number of dalys per 100 000 of population against aqi show a positive correlation (i.e. higher AQI is associated with higher number of deaths per 100 000 of population). However, the linear regression plot for different years can be observed to be slightly different, showing that the extent to which AQI is positively correlated to dalys per 100 000 of population is different.

From the graph, the year 2014 shows the steepest linear regression plot, followed by 2015, 2016 then 2017. This indicates that for a smaller increase in AQI index, the extent to which the number of dalys increased in 2014 is the highest, followed by 2015, 2016 and 2017, which happens to follow chronological order.

From the graphical representation above, It can also be observed that the datapoints are clustered around the low AQI and low number of dalys per 100 000 population for all years, however with a larger variation of dalys per 100 000 of the population for the lower AQI values.

In 2016 and 2017, there are a few datapoints with very high AQI values above 120 unlike in 2014 and 2015.

To show how the "extent of effect" of average deaths per 100 000 population against AQI population over the years, a geographical representation is plotted to investigate the its effect over the years for the same country using animation to compare directly for country to country.

In 2014, the country with country code ZAF (Republic of South Africa) shows significantly darker shade, with a index of 107.12 which is calculated by the normalized number of deaths per 100 000 divided by the normalized AQI value. The index of 107.12 is exceptionally high as the typical range of index values is 0 to 10 for this graph. Country of country code BOL (Bolivia) also shows a high index (though not as high as ZAF) index of 24.

In 2015, the countries with country codes ZAF, SRB, ARG and BOL (Republic of South Africa, Serbia, Argentina and Bolivia respectively) show significantly darker shade, with ZAF having an index of 10.64, BOL having an index of 6.90, SRB having an index of 8.52 and ARG having an index of 6.71.

In 2016, country with country code ZAF remains the country with the darkest shade, having an index of 15.06.

In 2017, country with country code ZAF remains the country showing a significantly darker shade as compared to the other countries with an index of 13.46.

A higher index (darker colour shade) indicates that there is a high average number of deaths for the same AQI compared to other countries.

In 2014, the country with country code ZAF (Republic of South Africa) shows significantly darker shade, with a index of 93.31 which is calculated by the normalized number of deaths per 100 000 divided by the normalized AQI value. Other significant darker shade countries (though not as high as BOL) include RUS (Russia) and BOL (Bolivia) of 15.98 and 18.95 respectively.

In 2015, the country with country codes SRB (Serbia) show significantly darker shade, with SRB having an index of 26.77.

In 2016, there country with country code BGR, ZAF, HUN and HRV (Bulgaria, Republic of South Africa, Hungary and Croatia respectively), with index of 20.13, 13.20, 16.86 and 13.77 respectively.

In 2017, the country with country code LTU and ZAF (Lithuania and Republic of South Africa) shows a significantly darker shade as compared to the other countries with an index of 12.32 and 21.07 respectively.

A higher index indicates that there is a high average number of deaths for the same AQI compared to other countries.

Q5. Effect of air pollution on health of people across different geographical location

The relationship (if any) between the number of respiratory related diseases and concentration of air pollutants may be stronger or weaker based on the geographical locations. Therefore, I would like to find out if this is true and how large of a variety it is.

I collate the level of different pollutants based on country and merge them to form one single overall pollutant index, then sum the total number of people affected by respiratory related diseases and divide it such that it is a value per 100 000 of country population. By grouping the countries and calculating the ratio of the amount of deaths / dalys per 100 000 of country population over the AQI value, we can observe which countries (based on geographical positions) have the highest index, which indicates that for a low AQI value it has a high number of deaths / dalys, which may indicate that the citizen's health is heavily affected by air pollution.

For this question, the AQI value used is not the average AQI value obtained from waqi_data_total and instead uses world_oecd_pm25_data_total which contains the pm2.5 value for the different countries instead due to:

  1. The number of countries covered by waqi_data_total is insufficent to cover a good representation of the entire world map
  2. From previous research questions, it can be observed that the pm2.5 aqi value is seen to correlate the most with air pollution related deaths and lives lost (dalys), therefore an appriopriate measure to be used in this question

To investigate the extent to which the air pollution affects the average number of deaths and average number of lives lost (dalys) due to air pollution related causes, an index is taken (calculated by AVERAGE DEATHS or DALYS per 100 000 / AQI value). Min-max data normalisation is then done to the respective variables (average deaths / dalys and aqi) to a value between 0 and 1 before taking the ratio.

From this geographical graph, it can be seen that North and South America has relatively low index as shown by the light colour. Countries in Africa show a darker shade of colour, indicating that the index is relatively higher.

From the graph, it can also be seen that (coutry code of AUS) Australia has the highest index for dalys of 1.48, indicating it has a high number of dalys per 100 000 of country population to AQI index ratio.

From this geographical graph, it can be seen that North and South America has relatively low index as shown by the light colour. Countries in Africa show a darker shade of colour, indicating that the index is relatively higher.

From the graph, it can also be seen that Brunei (country code of BRN) has the highest index for deaths of 2.31, indicating it has a high number of deaths per 100 000 of country population to AQI index ratio.

Australia (country code of AUS) still has a significantly high index (though slightly lower than that for dalys) of 1.04.

Q6. Trend of air pollution/health factors across different years

This shows the average deaths per 100 000 of country population over the years. By playing the animation, the colour shade of each country can be observed to change. When dragging the animation bar over the years, it can be observed that in general, the colours turned a darker shade of blue / lighter shade of red as the years progress, indicating that the average number of deaths per 100 000 of the country population has been decreasing progressively.

It can also be noted that Papau New Guinea (country code of PNG) has stayed the country with the highest average deaths per 100 000 of population over the years, having the darkest shade of red while majority of the other countries remain different shades of blue.

This shows the average dalys per 100 000 of country population over the years. By playing the animation, the colour shade of each country can be observed to change. When dragging the animation bar over the years, it can be observed that in general, the colours turned a darker shade of blue / lighter shade of red as the years progress (a more obvious change in shade as compared to average deaths), indicating that the average number of dalys per 100 000 of the country population has been decreasing.

It can also be noted several African countries have high average dalys per 100 000 in 1990, with countries SSD (South Sudan), TCD (Chad), AGO (Angola), GNQ (Equatorial Guinea), NER (Niger), GIN (Guinea), SLE (Sierra Leone) and other countries like AFG (Afghanistan), KHM (Cambodia), LAO (Laos) having average dalys of above 20k.

The graph below has been scaled to fit a range of y from 0 to 30k as for auto scaling, it will scale up to 80k in the earlier years due to outliers from the pollutant CO that make it hard to observe the trend of actual amount of emissions for the lower values as they are all clustered at too low values.

From the animation above, it can be observed that the number of outliers (able to see distinct points for amount of emissions above the cluster) decreases over the years.

Generally, the pollutant CO has the largest range of amount of emissions. By hovering the mouse over the points, it can be observed that USA (United sates of America) is the country with one of the most amount of emissions for all different pollutants (from figure 6.3.1).

Majority of the pollutants are clustered in the 0k to 5k kilotons of emissions.

From the graph above, it can be observed that the pollutant PM 2.5 has more spread out AQI values, followed by PM10, while O3, NOX and CO are mostly clustered at very low AQI values ranging between 0 to 50.

In 2014, it is evident that there is an outlier for pollutant PM 2.5 of extremely high AQI value of 338.5 for the country Denmark (country code of DNK) and an outlier for pollutant PM10 of high AQI value of 166.875 for the country India (country code of IND).

From the graph above, it can be observed that the death causes ozone and pm have lower number of deaths per 100 000 generally, with the majority of datapoints clustered at 0 to 50 number of deaths per 100 000 of country population for ozone, and between 0 to 150 deaths per 100 000 of country population over the years. For causes air pollution, chronic respiratory disease, household and lower respiratory infection, the range of number of deaths per 100 000 have a larger range.

For the chronic respiratory disease cause, there is one consistent outlier from 1995 to 2016 which remains the highest number of deaths which is the country Papau New Guinea (PNG). It can also be observed the Papau New Guinea also has the highest number of deaths for causes air pollution and household over the years.

From the heatmap above, it can be observed that the number of average deaths per 100 000 for air pollution (improved the most), chronic respiratory disease, lower respiratory infection, household, pm (improved slightly) all improved. Ozone already has low number of average deaths from 1990.

It can also be observed that the death cause that contributes the most average number of deaths is air polution, followed by chronic respiratory disease, lower respiratory infection, household, PM then ozone.

From the graph above, it can be observed that the range of number of dalys per 100 000 decreases over the years, with the datapoints clustering to the smaller values. It can also be observed that the cause PM has a smaller range of number of dalys per 100 000 value as compared to air pollution and household, solid fuel causes.

Egypt (country code EGY) remains the country with the highest dalys per 100 000 for the cause PM.

From 2005 onwards, Chad (country code TCD) remains the country with the highest dalys per 100 000 for the cause air pollution.

From 2006 onwards, Chad (country code TCD) remains the country with the highest dalys per 100 000 for the cause of household, solid fuel.

From the heatmap above, it can be observed that the number of average dalys per 100 000 for air pollution (improved the most), followed by hosuehold, solid fuel then PM (improved slightly) which all improved and decreased over the years. PM already has low number of average dalys from 1990.

It can also be observed that the dalys cause that contributes the most average number of dalys is air polution, followed by household, solid fuel then PM.

From the graph above, it can be observed that for North and South America, the colour shade turns lighter over the years, indicating that the AQI value decreased over the years. As for the African continent, majority of the countries remain having high AQIs over the years.

Australia (country code AUS) and New Zealand (country code NZL) remain light coloured from the start, having the lowest values of 15.83 and 22.90 respectively in 1990 all the way to 2011, and remain one of the countries with the lowest AQi from 2011 to 2019.

USA, Canada (country code CAN), Finland (country code FIN), Sweden (country code SWE) are among the countries that have the best improvement in AQI over the years, with the largest change in shades of color from dark blue to light green.

The graph above shows the average number of deaths per 100 000 of population over the years. It can be observed clearly that there is a decreasing linear trend for the average deaths per 100 000 of the population over the years, indicating that the number of average deaths per 100 000 is decreasing as time progresses.

The graph above shows the average number of dalys per 100 000 of population over the years. It can be observed clearly that there is a decreasing linear trend for the average dalys per 100 000 of the population over the years, indicating that the number of average dalys per 100 000 is decreasing as time progresses.

The graph above shows a general decreasing trend for the AQI value over the years. From 2014 to 2015, the translucent error bars are larger, showing that there is a greater range of AQI values for the different countries. Interestingly, there is a sudden increase in AQI values from 2016 to 2017 and from 2020 to 2021, with the increase from 2016 to 2017 being a sharper increase than from 2020 to 2021.

Results Discussion

Results

  1. Research Question 1 - Effect of type of air pollutant on health of people
  2. Research Question 2 - Effect of air pollution on death / dalys cause (respiratory-related diseases/air-pollution related)
  3. Research Question 3 - Effect of air pollution on health of different genders
  4. Research Question 4 - Effect of air pollution on health of people from 2014 to 2017
  5. Research Question 5 - Effect of air pollution on health of people across different geographical location
  6. Research Question 6 - Trend of air pollution/health factors across different years

The function below results_linregmodel shows a full display of graphs for a single linear regression model given the inputs - dataframe, x-axis, y-axis and title as subplots. The first subplot displayed will be the normal linear regression model, with indication of the test data and train data. The second subplot will be a bargraph comparing the fitted values vs their actual values to see how well they match to each other / how far the disparity is. The third and fourth subplots are a residual plot and KDE plot respectively for the actual vs fitted values to evaluate how suitable a single linear regression model is. At the side of the graphs, the relevant key variables are displayed, including the slope, P value, R value (correlation), MSE and R^2 value.

The function below results_multilinregmodel shows a full display of graphs for a multi linear regression model given the inputs - dataframe, x-axis list, y-axis and title as subplots. Since for multi linear regression models, it is difficult to plot the actual graph, only two subplots are plotted to evaluate the suitability of a multi linear regression model. The first and second subplot display the barchart of the actual vs fitted values and KDE plot of the actual vs fitted values respectively. At the side, the equation of the line is stated.

Q1. Research Question 1 - Effect of type of air pollutant on health of people

From figure 1.3.2, further investigation is done by performing the linear regression model on the pollutant PM2.5 for average dalys per 100 000 against AQI.The P value is very small while the R value (correlation value) is 0.5677, indicating there is a moderate positive correlation. From the residual plot, it can be observed that the points are roughly randomly separated around the x axis, however there are some points which are very high above the x axis. The KDE plot of fitted vs actual value does not coincide very nicely, therefore a linear regression model may not be the best model.

However, it is still evident that there is moderate correlation between average dalys per 100 000 against AQI for pollutant PM2.5.

From figure 1.3.2, further investigation is done by performing the linear regression model on the pollutant PM10 for average dalys per 100 000 against AQI.The P value is small of 0.00085 while the R value (correlation value) is 0.44091, indicating there is a moderate positive correlation (slightly less than for Pollutant PM2.5). From the residual plot, it can be observed that the points are roughly randomly separated around the x axis, however there are some points which are very high above the x axis. The KDE plot of fitted vs actual value does not coincide and are very far apart, therefore a linear regression model is not a suitable model.

With the R^2 value being 0.1944, a linear regression model is not suitable for the graph of average dalys per 100 000 against AQI for PM10.

From figure 1.3.2, further investigation is done by performing the linear regression model on the pollutant PM2.5 for average deaths per 100 000 against AQI.The P value is very small while the R value (correlation value) is 0.68594, indicating there is a moderate strong positive correlation. From the residual plot, it can be observed that the points are roughly randomly separated around the x axis, with very little points which are very high above the x axis. The KDE plot of fitted vs actual value does coincide quite well, therefore a linear regression model may be a suitable model.

Therefore, we can conclude that the graph of average deaths per 100 000 against AQI for pollutant PM2.5 has a moderately strong linear correlation.

From figure 1.3.2, further investigation is done by performing the linear regression model on the pollutant PM10 for average deaths per 100 000 against AQI.The P value is very small while the R value (correlation value) is 0.63207, indicating there is a moderate strong positive correlation. From the residual plot, it can be observed that the points are roughly randomly separated around the x axis, with very little points which are very high above and below the x axis. The KDE plot of fitted vs actual value does not really coincide well, therefore a linear regression model may not be a suitable model.

However, we can conclude that the graph of average deaths per 100 000 against AQI for pollutant PM2.5 has a moderately strong positive correlation.

A multi linear regression model is done for AQI for PM2.5 and PM10 to investigate whether the combined trend will have a better linear correlation for average dalys per 100 000. Form the residual plot, it can be observed that the actual and fitted values do not coincide very well though the general trend follows, therefore a multi linear regression model may not suitable.

A multi linear regression model is done for AQI for NOX, SOX, O3 and CO to investigate whether the combined trend will have a better linear correlation for average dalys per 100 000. Form the residual plot, it can be observed that the actual and fitted values do not coincide very well, therefore a multi linear regression model is not suitable.

A multi linear regression model is done for AQI for NOX, SOX, O3 and CO to investigate whether the combined trend will have a better linear correlation for average deaths per 100 000. Form the residual plot, it can be observed that the actual and fitted values do not coincide very well, therefore a multi linear regression model is not suitable.

A multi linear regression model is done for AQI for PM2.5 and PM10 to investigate whether the combined trend will have a better linear correlation for average deaths per 100 000. Form the residual plot, it can be observed that the actual and fitted values do not coincide very well, therefore a multi linear regression model may not suitable.

A multi linear regression model is done for AQI for all pollutants to investigate whether the combined trend will have a better linear correlation for average deaths per 100 000. Form the residual plot, it can be observed that the actual and fitted values do not coincide very well, therefore a multi linear regression model may not suitable.

A multi linear regression model is done for AQI for all pollutants to investigate whether the combined trend will have a better linear correlation for average dalys per 100 000. Form the residual plot, it can be observed that the actual and fitted values do not coincide well, therefore a multi linear regression model is not suitable.

A multi linear regression model is done for AQI for pollutants with non negative correlation (PM2.5, PM10, NOX, SOX and CO) to investigate whether the combined trend will have a better linear correlation for average deaths per 100 000. Form the residual plot, it can be observed that the actual and fitted values does follow a similar trend though did not coincide well at the break, therefore a multi linear regression model may not suitable.

A multi linear regression model is done for AQI for pollutants with non negative correlation (PM2.5, PM10, SOX, NOX and CO) to investigate whether the combined trend will have a better linear correlation for average dalys per 100 000. Form the residual plot, it can be observed that the actual and fitted values do not coincide well, therefore a multi linear regression model is not suitable.

All SLRM below show moderate positive correlation. This is expected as with greater AQI value (i.e. worsened air quality), there will be more respiratory related / pollution related health problems, therefore an increase in average deaths / dalys per 100 000 of country population.

  1. Suitable positive linear correlation model
    1. SLRM Average deaths per 100 000 against AQI for pollutant PM 2.5
  2. May not be suitable linear correlation model (show some common trend for KDE though does not coincide fully)
    1. SLRM Average dalys per 100 000 against AQI for pollutant PM 2.5
    2. SLRM Average deaths per 100 000 against AQI for pollutant PM 10
    3. MLRM Average dalys per 100 000 against AQI for pollutant PM2.5 and PM 10
    4. MLRM Average deaths per 100 000 against AQI for pollutant PM2.5 and PM 10
    5. MLRM Average deaths per 100 000 against AQI for all pollutants
    6. MLRM Average deaths per 100 000 against AQI for pollutants PM2.5, PM10, SOX, NOX and CO
  3. Not suitable linear correlation model
    1. SLRM Average dalys per 100 000 against AQI for pollutant PM 10
    2. MRLM Average dalys per 100 000 against AQI for pollutant CO, NOX, SOX and O3
    3. MLRM Average dalys per 100 000 against AQI for all pollutants
    4. MLRM Average dalys per 100 000 against AQI for pollutants PM2.5, PM10, SOX, NOX and CO
    5. MRLM Average deaths per 100 000 against AQI for pollutant CO, NOX, SOX and O3

From figure 1.3.2, further investigation is done by performing the linear regression model on the death cause air pollution for average deaths per 100 000 against AQI. The P value is very small while the R value (correlation value) is 0.68018, indicating there is a moderate strong positive correlation. From the residual plot, it can be observed that the points are roughly randomly separated around the x axis, though the points are clustered around the x axis for smaller values of AQI. The KDE plot of fitted vs actual value does not coincide quite well, therefore a linear regression model may not be a suitable model.

However, we can still conclude that the graph of average deaths per 100 000 against AQI for death cause air pollution has a moderately strong positive correlation.

From figure 1.3.2, further investigation is done by performing the linear regression model on the death cause household for average deaths per 100 000 against AQI. The P value is very small while the R value (correlation value) is 0.60683, indicating there is a moderate positive correlation. From the residual plot, it can be observed that the points are roughly randomly separated around the x axis, though the points are clustered around the x axis for smaller values of AQI. The KDE plot of fitted vs actual value follows the same trend though does not coincide nicely, therefore a linear regression model may not be a suitable model.

However, we can still conclude that the graph of average deaths per 100 000 against AQI for death cause household has a moderate positive correlation.

From figure 1.3.2, further investigation is done by performing the linear regression model on the dalys cause air pollution for average dalys per 100 000 against AQI.The P value is very small while the R value (correlation value) is 0.68018, indicating there is a moderate strong positive correlation. From the residual plot, it can be observed that the points are roughly randomly separated around the x axis, though the points are clustered around the x axis for smaller values of AQI. The KDE plot of fitted vs actual value does not coincide quite well, therefore a linear regression model may not be a suitable model.

Therefore, we can conclude that a linear regression model is not suitable for the graph of average dalys per 100 000 against AQI for dalys cause air pollution has a moderate positive correlation.

All SLRM below show moderate positive correlation.

  1. May not be suitable linear correlation model
    1. SLRM Average deaths per 100 000 against AQI for death cause pollution
    2. SLRM Average deaths per 100 000 against AQI for death cause household
    3. SLRM Average dalys per 100 000 against AQI for dalys cause air pollution

Q3. Effect of air pollution on health of different genders

Based on this graph above and figure 3.2, it can be observed that the median value for females (of 4.66) is slightly lower than males (of 5.75), with the median value of This shows that death of females are less affected by air pollution as compared to men. Both the range and interquartile range of females (17.62 and 6.49 respectively) is smaller than male (24.29 and 8.7175 respectively), showing that the average number of deaths per 100 000 for females are more clustered and spread over a smaller range of values. Both females and males show several outliers above Q3 + 1.5 IQR, showing that there are few exceptions with extremely high average deaths per 100 000.

For both females and males, the country Bulgaria (country code BGR) has the highest average number of deaths which are 65.48 and 74.6 respectively. This may be due to males being more exposed to air pollution outdoors, or a more plausible reason that females already having a longer life expectancy and therefore less average number of deaths than males.

Q4. Effect of air pollution on health of people from 2014 to 2017

From figure 4.1, it can be observed that all years show a positive correlation between average number of deaths per 100 000 against AQI value. The year 2014 shows the steepest linear regression plot, followed by 2015, 2016 then 2017. This indicates that for a smaller increase in AQI index, the extent to which the number of deaths increased in 2015 is the highest, followed by 2015, 2016 and 2017, which happens to follow chronological order, which may be due to healthcare improving faster than air pollution quality from 2014 to 2017.

From figure 4.2, in 2016 and 2017, there are a few datapoints with very high AQI values above 120 unlike in 2014 and 2015.

From 2014 to 2017, it can be seen that the datapoints are generally clustered at the lower AQI and lower average deaths per 100 000 population value.

From figure 4.3, it can be observed that all years show a positive correlation between average number of dalys per 100 000 against AQI value. The year 2014 shows the steepest linear regression plot, followed by 2015, 2016 then 2017. This indicates that for a smaller increase in AQI index, the extent to which the number of dalys increased in 2014 is the highest, followed by 2015, 2016 and 2017, which happens to follow chronological order.

From figure 4.4, for years 2014 to 2017 the datapoints are clustered around the low AQI and low number of dalys per 100 000 population for all years, however with a larger variation of dalys per 100 000 of the population for the lower AQI values.

In 2016 and 2017, there are a few datapoints with very high AQI values above 120 unlike in 2014 and 2015.

Q5. Effect of air pollution on health of people across different geographical location

From figure 5.1, it can be seen different countries have varying index for average that North and South America has relatively low index as shown by the light colour. Countries in Africa show a darker shade of colour, indicating that the index is relatively higher.

From the graph, it can also be seen that (coutry code of AUS) Australia has the highest index for dalys of 1.49, indicating it has a high number of dalys per 100 000 of country population to AQI index ratio.

From figure 5.2, it can be seen that North and South America has relatively low index as shown by the light colour. Countries in Africa show a darker shade of colour, indicating that the index is relatively higher.

From the graph, it can also be seen that Brunei (country code of BRN) has the highest index for deaths of 2.31, indicating it has a high number of deaths per 100 000 of country population to AQI index ratio.

Australia (country code of AUS) still has a significantly high index (though slightly lower than that for dalys) of 1.04.

Q6. Trend of air pollution/health factors across different years

From figure 6.1, in general the average number of deaths per 100 000 of the country population has been decreasing progressively. It can also be noted that Papau New Guinea (country code of PNG) has stayed the country with the highest average deaths per 100 000 of population over the years as compared to other countries which have much lower average deaths per 100 000 of population.

From figure 6.2, in general, indicating that the average number of dalys per 100 000 of the country population has been decreasing, with the decrease more evident than in figure 6.1.

It can also be noted several African countries have high average dalys per 100 000 in 1990, with countries SSD (South Sudan), TCD (Chad), AGO (Angola), GNQ (Equatorial Guinea), NER (Niger), GIN (Guinea), SLE (Sierra Leone) and other countries like AFG (Afghanistan), KHM (Cambodia), LAO (Laos) having average dalys of above 20k.

From figure 6.3.1 and figure 6.3.2, it can be observed that the number of outliers with very high amount of emissions decreases over the years.

Generally, the pollutant CO has the largest range of amount of emissions. It can also be noted that USA (United sates of America) is the country with one of the most amount of emissions for all the different pollutants. Majority of the pollutants are clustered in the 0k to 5k kilotons of emissions.

From figure 6.4, pollutant PM 2.5 has a larger range of AQI values, followed by PM10, while O3, NOX and CO are mostly clustered at very low AQI values ranging between 0 to 50.

In 2014, it is evident that there is an outlier for pollutant PM 2.5 of extremely high AQI value of 338.5 for the country Denmark (country code of DNK) and an outlier for pollutant PM10 of high AQI value of 166.875 for the country India (country code of IND).

From figure 6.5.1, it can be observed that the death causes ozone and pm have lower number of deaths per 100 000 generally. For causes air pollution, chronic respiratory disease, household and lower respiratory infection, the range of number of deaths per 100 000 is larger.

For the chronic respiratory disease cause, there is one consistent outlier from 1995 to 2016 which remains the highest number of deaths which is the country Papau New Guinea (PNG). It can also be observed the Papau New Guinea also has the highest number of deaths for causes air pollution and household over the years.

From figure 6.5.2, the death cause that contributes the most average number of deaths is air polution, followed by chronic respiratory disease, lower respiratory infection, household, PM then ozone.

From figure 6.6.1, the range of number of dalys per 100 000 decreases over the years, with majority of countries having small number of dalys per 100 000 of country population. It can also be observed that the cause PM has a smaller range of number of dalys per 100 000 value as compared to air pollution and household, solid fuel causes.

Egypt (country code EGY) remains the country with the highest dalys per 100 000 for the cause PM.

From 2005 onwards, Chad (country code TCD) remains the country with the highest dalys per 100 000 for the cause air pollution. From 2006 onwards, Chad (country code TCD) remains the country with the highest dalys per 100 000 for the cause of household, solid fuel.

From figure 6.6.2, the dalys cause that contributes the most average number of dalys is air polution, followed by household, solid fuel then PM.

From figure 6.7, it can be observed that for North and South America, the colour shade turns lighter over the years, indicating that the AQI value decreased over the years. As for the African continent, majority of the countries remain having high AQIs over the years.

Australia (country code AUS) and New Zealand (country code NZL) remain light coloured from the start, having the lowest values of 15.83 and 22.90 respectively in 1990 all the way to 2011, and remain one of the countries with the lowest AQI from 2011 to 2019.

USA, Canada (country code CAN), Finland (country code FIN), Sweden (country code SWE) are among the countries that have the best improvement in AQI over the years, with the largest change in shades of color from dark blue to light green.

Figure 6.8 shows a clear decreasing linear trend, indicating that the number of average deaths per 100 000 is decreasing as time progresses.

Figure 6.9 shows a clear decreasing linear trend, indicating that the number of average dalys per 100 000 is decreasing as time progresses.

Figure 6.10 shows a general decreasing trend for the AQI value over the years. From 2014 to 2015, there is a greater range of AQI values for the different countries. Interestingly, there is a sudden increase in AQI values from 2016 to 2017 and from 2020 to 2021, with the increase from 2016 to 2017 being a sharper increase than from 2020 to 2021. The sudden decrease in AQI value from 2016 to 2017 may be accounted by the initial commencement of the Paris Climate Agreement in 2016, and the increase in AQI from 2020 to 2021 by the slight reopening of the economy due to the COVID-19 Pandemic in 2020.

Testing and Verification of Results

Testing and Verification of Results is done for the linear regression models plotted for Research Question 1 and 2 by plotting the Residual and KDE plots to check whether the linear regression models are suitable for the dataset, as well as barcharts on top of KDE plots for multi linear regression models.

Different representations of the same data are also done to observe the data from different visual points of view. For example, a heatmap and stripplot were used to investigate the same data of how the average number of deaths/dalys per 100 000 of country population for different deaths/dalys causes vary over the different years.

Conclusion and recommendations

In conclusion, there are many factors of air pollution that possibly effect different aspects of health of people.

These factors include type of pollutants, chronological time series, geographical location and different aspects include different causes and gender. The extent to which different factors affect and different aspects are affected vary from pollutant type to pollutant type / death cause to death cause. These trends show a correlation and may not necessarily be a direct causation as there are many factors in play. However, the postiive correlation for many of the different pollutant to health still signals the importance of monitoring the levels of air pollution, which can play a role in improving the health of people across the world.

It is heartening to note that the number of deaths/dalys as well as AQI values have decreased over the years and therefore we should continue doing so, not only protecting the people but also protecting the world with improved control over air pollution.

Further investigation can be done on other factors including types of air pollution (i.e. industrial air pollution, household air pollution, burning of forest pollutions), and alternative models can be used to investigate whether the models can be better fitted (if linear regression model is not appropriate, as a further investigation maybe logarithm functions, exponential functions, curvature quadratic models).

References

The numbered list of datasets below are hyperlinked to the website for which the respective csv data is obtained.

  1. air_pollutant_co2.csv
  2. air_pollutant_lead.csv
  3. air_pollution_exposure.csv
  4. aqi_breakpoints.csv
  5. death-rates-from-air-pollution.csv
  6. disease-burden-by-risk-factor.csv
  7. parameters.csv
  8. pneumonia-death-rates-age-standardized.csv
  9. respiratory-disease-death-rate.csv
  10. singstat_subcollation.csv (actual link needs to be pasted in to work: "https://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=14589")
  11. stats_oecd_pollutants.csv
  12. who_respiratory_pollution_caused_rate.csv)
  13. waqi-covid19-airqualitydata-2015H1.csv
  14. waqi-covid19-airqualitydata-2016H1.csv
  15. waqi-covid19-airqualitydata-2017H1.csv
  16. waqi-covid19-airqualitydata-2018H1.csv
  17. waqi-covid19-airqualitydata-2019Q1.csv
  18. waqi-covid19-airqualitydata-2019Q2.csv
  19. waqi-covid19-airqualitydata-2019Q3.csv
  20. waqi-covid19-airqualitydata-2019Q1.csv
  21. waqi-covid19-airqualitydata-2020Q1.csv
  22. waqi-covid19-airqualitydata-2020Q2.csv
  23. waqi-covid19-airqualitydata-2020Q3.csv
  24. waqi-covid19-airqualitydata-2020Q1.csv
  25. waqi-covid19-airqualitydata-2021.csv
  26. https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_future_population
  27. https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes

Appendix

After joining the waqi_data_total to the dataset, it can be observed that there are a few columns wehere AQI for the respective countries is not present. An attempt to fill the remaining AQI data with only PM 2.5 data from world_oecd_pm25_data_total was made. However, it showed that the data was skewed withe the AQI coming from only PM 2.5 deviating a lot from the other AQI values, therefore not taken as a valid result.

As this animation graph uses the same y axis, the data cannot be observed clearly for certain causes of health, therefore an unsuitable representation.

Singapore collated data is not used in the final project after further consideration of the research questions.

Originally, it was assumed that the values provided by WAQI website were the breakpoint values, therefore a revised AQI Value was converted using those values as breakpoints. However when the same EDA was performed on the revised AQI, it was noted that the graphs gave very contrasting values and results. In addition, the website mentioned "All air pollutant species are converted to the US EPA standard (i.e. no raw concentrations)", which was unclear whether it was the AQI value or breakpoints values from, unable to clearly come to a conclusion whether the value was the breakpoint (concentration) or the aqi value for the respective pollutant. Therefore, a final conclusion and assumption was made that the value provided is already the AQI value.

To note the original type of "aqi" found in Specie column of raw data of waqi datasets, it was assumed to be the mean aqi of all the different "Specie", therefore the assumption made that the values for the respective pollutants was the AQI value for each pollutant type.